Skip to content

feat(index): thread DataFusion MemoryPool through IVF index build pipeline#7312

Draft
wjones127 wants to merge 1 commit into
lance-format:mainfrom
wjones127:worktree-serialized-hopping-bear
Draft

feat(index): thread DataFusion MemoryPool through IVF index build pipeline#7312
wjones127 wants to merge 1 commit into
lance-format:mainfrom
wjones127:worktree-serialized-hopping-bear

Conversation

@wjones127

Copy link
Copy Markdown
Contributor

Closes #7305. Part of epic #7301.

What

Threads a per-build Arc<dyn MemoryPool> through the IVF vector index build pipeline (shuffle phase + per-partition sub-index construction).

utils.rs — new make_index_memory_pool(): reads LANCE_INDEX_MEMORY_BUDGET env var (bytes). Returns a GreedyMemoryPool at that limit, or UnboundedMemoryPool (no change to existing behavior when unset).

shuffler.rscreate_ivf_shuffler gains a memory_budget: Option<usize> parameter. When set, TwoFileShuffler::batch_size_bytes is sized as max(budget / 2, 128 MiB) instead of the fixed 128 MiB default.

builder.rsIvfIndexBuilder::with_memory_pool builder method. Each per-partition build acquires a MemoryReservation before loading partition data. try_grow returning Err is the spill signal per the issue spec; for now it logs a warning and continues (actual spill reaction deferred to #7300).

vector.rs / ivf.rs — all three IVF build entry points (build_distributed_vector_index, build_vector_index, build_vector_index_incremental, optimize_vector_indices_v2) call make_index_memory_pool(), pass the budget to create_ivf_shuffler, and call .with_memory_pool(pool) on every IvfIndexBuilder.

Testing

New unit tests for batch_size_from_budget and create_ivf_shuffler batch-size-from-budget behavior. All 135 index::vector tests pass.

…eline

Implements the pool-threading part of lance-format#7305.

- Add `make_index_memory_pool()` in `utils.rs`: reads `LANCE_INDEX_MEMORY_BUDGET`
  env var (bytes); returns a `GreedyMemoryPool` at that limit or an unbounded pool
  (existing default behavior).
- Add `memory_budget` parameter to `create_ivf_shuffler`; when set, sizes
  `TwoFileShuffler::batch_size_bytes` via `batch_size_from_budget(budget)` (50%
  of budget, floor 128 MB).
- Add `IvfIndexBuilder::with_memory_pool`; each per-partition build acquires a
  `MemoryReservation` before loading data. `try_grow` returning `Err` is logged as
  a warning (actual spill reaction is deferred to lance-format#7300).
- Wire `make_index_memory_pool` through all three IVF build entry points:
  `build_distributed_vector_index`, `build_vector_index`, and
  `build_vector_index_incremental` in `vector.rs`, and `optimize_vector_indices_v2`
  in `ivf.rs`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer enhancement New feature or request labels Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Thread MemoryPool through vector IVF index build (incl. shuffler batch sizing)

1 participant